WebChat Voice Channel

The WebChat Voice Channel enables natural voice interactions within the Druid web snippet, allowing users to switch effortlessly between typing and speaking. By using native or third-party STT and TTS services, Druid provides a responsive and accessible interface for web-based AI Agents.

Out-of-the-box speech services includes:

  • Druid (available as tenant feature in technology preview in Drud 9.20)

  • Microsoft Cognitive Services

  • ElevenLabs (TTS available starting with Druid 9.15 and STT available starting with Druid 9.18)

  • Deepgram (STT only)

  • Soniox (STT)

  • Speechmatics (STT available starting with Druid 9.20)

To integrate a preferred speech provider not listed above, reach out to your Druid representative.

How the channel works

Once speech services are enabled, the interaction follows a streamlined flow:

  1. Users click the microphone icon in the chat snippet to begin speaking.
  2. The Speech-to-Text (STT) service processes the voice input in real-time, displaying a transcript in the input field.
  3. Once the sentence is complete, the AI Agent processes the text and generates a response.
  4. The response is displayed as text and simultaneously spoken back to the user via the Text-to-Speech (TTS) service.

Enabling Voice Interactions

This section explains how to enable voice interactions.

Step 1: Configure Speech Providers

Configure voice interactions directly within the WebChat channel settings:

  1. In the Druid Portal, navigate to your AI Agent and select the Channels tab.
  2. Search for 'webchat' and click on the WebChat card.
  3. The channel configuration modal opens.

  4. At the top of the modal, click the tab for the speech provider you wish to configure.
  5. Configure the desired speech providers following the instructions in the subsequent sections.

Setting up Druid-native speech

To use voice interactions with Druid-native speech services, you need the API key from your Druid representative.

NOTE: Druid-native STT/TTS is available in technology preview in Druid 9.20.

Setup procedure:

  1. In the channel configuration modal, click the Druid tab.
  2. Enter the details you received from your Druid representative.
  3. Map the languages your AI Agent supports to specific Druid languages in the configuration table.
    1. In the table, click the plus icon (+) to add a row.
    2. From the Language dropdown, select the AI Agent language (default or additional).
    3. From the Voice dropdown, select the specific voice the AI Agent will use to respond. The model is automatically filled in after you select the voice.
    4. Click the Save icon displayed inline.
    5. Add the desired voice per AI Agent language as you prefer.

  4. Click Save at the bottom of the page and close the modal.

Setting up Microsoft Cognitive Services

IMPORTANT! To use voice interactions with MS Cognitive Services in production environments, contact your Druid representative for the necessary keys.
  1. In the channel configuration modal, click the Microsoft Cognitive Services tab.
  2. Provide the Key and Region provided by Druid Support Team in the voice activation email.
  3. HINT: For demo purposes, you can request a test key to Druid Tech Support.
  4. Map the languages your AI Agent supports to specific voices in the configuration table.
    1. In the table below the Voice channel details, click the plus icon (+) to add a row.
    2. From the Language dropdown, select the AI Agent language (default or additional).
    3. From the Voice dropdown, select the specific voice the AI Agent will use to respond.
    4. Click the Save icon displayed inline.
  5. Save the configuration.

Setting up Deepgram

IMPORTANT! You can use Deepgram as voice provider for Webchat in DRUID 9.1 and higher for Speech-To-Text (STT) only.
Prerequisites
  • You need a Deepgram API Key with Member Permissions. Refer to Deepgram documentation (Token-Based Authentication) for information on how to create a key with Member permissions.
Setup procedure
  1. In the channel configuration modal, click the Deepgram tab.
  2. Enter your Deepgram API Key.
  3. Map the languages your AI Agent supports to specific Deepgram models in the configuration table.
    1. In the table, click the plus icon (+) to add a row.
    2. From the Language dropdown, select the AI Agent language (default or additional).
    3. From the Model dropdown, select the specific Deepgram model the AI Agent will use to respond.
    4. Click the Save icon displayed inline.
    HINT: For Druid versions prior to 9.6, provide the Deepgram model (e.g., nova-2-medical). See Deepgram documentation for the complete list of models available.
  4. Save the configuration.

Setting up ElevenLabs

Druid supports ElevenLabs as a high-quality Text-to-Speech (TTS) and Speech-To-Text (STT) provider, enabling your AI Agent to communicate using specialized synthetic voices and custom voice clones.

NOTE: ElevenLabs is available as TTS provider starting with Druid 9.15 and as STT provider starting with Druid 9.18.
Prerequisites
  • You need an ElevenLabs API Key. To get API key, go to https://elevenlabs.io/app/developers/api-keys and copy the key ID.
  • Make sure to grant the API Key Read permissions for the following endpoints:
    • Voices
    • Text to Speech
    • Speech to Speech
    • Speech to Text (for STT support)
    • Sound Effects
    • Audio Isolation

Setup procedure
  1. In the channel configuration modal, click the ElevenLabs tab.
  2. Enter your ElevenLabs API Key.
  3. Map the languages your AI Agent supports to specific ElevenLabs languages in the configuration table.
    1. In the table, click the plus icon (+) to add a row.
    2. From the Language dropdown, select the AI Agent language (default or additional).
    3. From the Voice dropdown, select the specific ElevenLabs voice the AI Agent will use to respond. The model is automatically filled in after you select the voice.
    4. Click the Save icon displayed inline.
  4. Click Save at the bottom of the page and close the modal.

Setting up Soniox

You can use Soniox as a Speech-To-Text (STT) provider for your AI Agent voice interactions. Its models natively support multiple languages and automatic language detection.

NOTE: Soniox is available as a STT provider starting with Druid 9.19.
Prerequisites
Setup procedure
  1. In the channel configuration modal, click the Soniox tab.
  2. Enter your Soniox API Key and select the model.
  3. Click Save at the bottom of the page and close the modal.

Setting up Speechmatics

Speechmatics is a speech-to-text provider available in the Voice Channel. It enables real-time and batch transcription using advanced automatic speech recognition (ASR) technology. It supports multiple languages and delivers accurate results across different accents and audio conditions, making it suitable for voice interactions and transcription scenarios.

NOTE: Speechmatics is available as a STT provider starting with Druid 9.20.
Prerequisites
Setup procedure
  1. In the channel configuration modal, click the Speechmatics tab.
  2. Enter your Speechmatics API Key.
  3. Save the configuration and close the modal.

Step 2. Enable Speech Services

Once the speech provider details are entered, you must explicitly activate them for the channel:

  1. In the WebChat configuration modal, click the General tab and scroll-down at the bottom of the modal.
  2. Select the primary Speech-to-Text Provider. If you select a provider other than Azure, you should also select a Fallback Speech-to-Text Provider. The fallback speech provider will be used automatically if primary speech provider does not support the user’s language. In Druid 9.18, you can also select both Azure and ElevenLabs as STT fallback provider and starting with 9.20 you can also use Druid.
  3. HINT: If Deepgram is set as the primary STT provider and Azure/ElevenLabs as fallback, and the user selects a language unsupported by Deepgram, the system falls back to Azure/ElevenLabs. Once the user changes the language back to one that Deepgram supports, the system will automatically switch back to Deepgram (the primary provider) for the remainder of the session.
  4. Select the primary Text-to-Speech Provider. If you selected ElevenLabs or Druid, you should also select the Fallback Text-to-Speech Provider. The fallback provider will be used automatically if the primary one does not support the user’s language.
  5. Click Save and close the modal.